Modelling and Optimizing on Syntactic N-Grams for Statistical Machine Translation
نویسنده
چکیده
The role of language models in SMT is to promote fluent translation output, but traditional n-gram language models are unable to capture fluency phenomena between distant words, such as some morphological agreement phenomena, subcategorisation, and syntactic collocations with string-level gaps. Syntactic language models have the potential to fill this modelling gap. We propose a language model for dependency structures that is relational rather than configurational and thus particularly suited for languages with a (relatively) free word order. It is trainable with Neural Networks, and not only improves over standard n-gram language models, but also outperforms related syntactic language models. We empirically demonstrate its effectiveness in terms of perplexity and as a feature function in string-to-tree SMT from English to German and Russian. We also show that using a syntactic evaluation metric to tune the log-linear parameters of an SMT system further increases translation quality when coupled with a syntactic language model.
منابع مشابه
Automatic Phrase Alignment Using statistical n-gram alignment for syntactic phrase alignment
A parallel treebank consists of syntactically annotated sentences in two or more languages, taken from translated (i.e. parallel) documents. These parallel sentences are linked through alignment. Much work has been done on sentence and word alignment, but not as much on the intermediate level. This paper explores using n-gram alignment created for statistical machine translation based on GIZA++...
متن کاملShallow-Syntax Phrase-Based Translation: Joint versus Factored String-to-Chunk Models
This work extends phrase-based statistical MT (SMT) with shallow syntax dependencies. Two string-to-chunks translation models are proposed: a factored model, which augments phrase-based SMT with layered dependencies, and a joint model, that extends the phrase translation table with microtags, i.e. perword projections of chunk labels. Both rely on n-gram models of target sequences with different...
متن کاملRevisiting the Case for Explicit Syntactic Information in Language Models
Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naı̈ve, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactical...
متن کاملA Dialogue Analysis Model With Statistical Speech Act Processing For Dialogue Machine Translation
In some cases, to make a proper translation of an utterance in a dialogue, the system needs various information about context. In this paper, we propose a statistical dialogue analysis model based on speech acts for Korean-English dialogue machine translation. The model uses syntactic patterns and N-grams reflecting the hierarchical discourse structures of dialogues. The syntactic pattern inclu...
متن کاملTackling Sparse Data Issue in Machine Translation Evaluation
We illustrate and explain problems of n-grams-based machine translation (MT) metrics (e.g. BLEU) when applied to morphologically rich languages such as Czech. A novel metric SemPOS based on the deep-syntactic representation of the sentence tackles the issue and retains the performance for translation to English as well.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- TACL
دوره 3 شماره
صفحات -
تاریخ انتشار 2015